3A: Data Frame

Readings

From R Coding Basics: An Introduction to the Basics of Coding in R by Dr. Gaston Sanchez:

Topics

  • Data frame and matrix

  • Creating a data frame

  • Selecting elements in a data frame

  • Adding, removing, and transforming a column

Data Frame

  • A matrix is a tabular data structure containing values of the same type.

  • A data frame is a matrix-like data structure in which each column may have a different type.

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male
  • To create a data frame, we can combine multiple vectors using
age <- c(25, 32, 28, 40)
hei <- c(168, 175, 160, 180)
wei <- c(65, 70, 55, 80)
gen <- c("Female", "Male", "Female", "Male")

data.frame(age, hei, wei, gen)
  age hei wei    gen
1  25 168  65 Female
2  32 175  70   Male
3  28 160  55 Female
4  40 180  80   Male
  • It is possible to specify the column names.
age <- c(25, 32, 28, 40)
hei <- c(168, 175, 160, 180)
wei <- c(65, 70, 55, 80)
gen <- c('Female', 'Male', 'Female', 'Male')

data.frame(Age = age, Height = hei, Weight = wei, Gender = gen)
  Age Height Weight Gender
1  25    168     65 Female
2  32    175     70   Male
3  28    160     55 Female
4  40    180     80   Male
  • We can also type in the values directly.
data.frame(
  Age    = c(25, 32, 28, 40), 
  Height = c(168, 175, 160, 180), 
  Weight = c(65, 70, 55, 80), 
  Gender = c('Female', 'Male', 'Female', 'Male')
)
  Age Height Weight Gender
1  25    168     65 Female
2  32    175     70   Male
3  28    160     55 Female
4  40    180     80   Male

💻 Hands-On

Create a data frame from the following data set

Student Score Allergy
Alice 92 peanut
Bob 88 none
Calvin 95 seafood
data.frame(
  Student = c('Alice', 'Bob', 'Calvin'), 
  Score   = c(92, 88, 95), 
  Allergy = c('peanut', 'none', 'seafood')
)
  Student Score Allergy
1   Alice    92  peanut
2     Bob    88    none
3  Calvin    95 seafood

Subsetting

The dollar operator $

  • A specific column (variable) can be quickly accessed using the dollar operator $

  • Note that the output is a vector, not a data frame.

df <- data.frame(
  Age    = c(25, 32, 28, 40), 
  Height = c(168, 175, 160, 180), 
  Weight = c(65, 70, 55, 80), 
  Gender = c('Female', 'Male', 'Female', 'Male')
)

df$Height
[1] 168 175 160 180

💻 Hands-On

Write R code to subset each individual column of df

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male
df$Age
[1] 25 32 28 40
df$Height
[1] 168 175 160 180
df$Weight
[1] 65 70 55 80
df$Gender
[1] "Female" "Male"   "Female" "Male"  

💻 Hands-On

Try the following R code to see what it returns.

df$Gender == 'Male'

df$Gender == 'Female'

df$Age < 30

df$Height[df$Gender == 'Male']

df$Height[df$Gender == 'Female']

df$Height[df$Age < 30]

In this activity, we use logical indexing to subset observations that satisfy certain characteristic.

df$Gender == 'Male'
[1] FALSE  TRUE FALSE  TRUE
df$Gender == 'Female'
[1]  TRUE FALSE  TRUE FALSE
df$Age < 30
[1]  TRUE FALSE  TRUE FALSE
df$Height[df$Gender == 'Male']
[1] 175 180
df$Height[df$Gender == 'Female']
[1] 168 160
df$Height[df$Age < 30]
[1] 168 160

💻 Hands-On

Write R code to find the following:

  • Weights of all male individuals

  • Ages of all female individuals

  • Heights of those whose weights are at least 65

df$Weight[df$Gender == 'Male']
[1] 70 80
df$Age[df$Gender == 'Female']
[1] 25 28
df$Height[df$Weight >= 65]
[1] 168 175 180

The brackets []

  • Certain elements of a data frame can be subset using the brackets []

From Dr. Gaston Sanchez

💻 Hands-On

Write R code to subset the indicated values of df

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male

# row 1, column 3

# row 4, column 4

# row 3, column 2
# row 1, column 3
df[1, 3]
[1] 65
# row 4, column 4
df[4, 4]
[1] "Male"
# row 3, column 2
df[3, 2]
[1] 160

Subsetting rows

  • Subsetting rows can be done using numeric indexing with the brackets []

From Dr. Gaston Sanchez

💻 Hands-On

Write R code to subset the indicated values of df

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male

# row 3

# rows 2 to 4 (2, 3, 4)

# rows 1, 2, and 4 only
# row 3
df[3, ]
  Age Height Weight Gender
3  28    160     55 Female
# rows 2 to 4 (2, 3, 4)
df[2:4, ]
  Age Height Weight Gender
2  32    175     70   Male
3  28    160     55 Female
4  40    180     80   Male
# rows 1, 2, and 4 only
df[c(1, 2, 4), ]
  Age Height Weight Gender
1  25    168     65 Female
2  32    175     70   Male
4  40    180     80   Male

💻 Hands-On

Try the following R code to see what it returns.

df[df$Gender == 'Male', ]

df[df$Gender == 'Female', ]

df[df$Age < 30, ]
df[df$Gender == 'Male', ]
  Age Height Weight Gender
2  32    175     70   Male
4  40    180     80   Male
df[df$Gender == 'Female', ]
  Age Height Weight Gender
1  25    168     65 Female
3  28    160     55 Female
df[df$Age < 30, ]
  Age Height Weight Gender
1  25    168     65 Female
3  28    160     55 Female

💻 Hands-On

Write R code to find the following all individuals who are:

  • Older than 25

  • Shorter than 170

  • Heavier than 60

df[df$Age > 25, ]
  Age Height Weight Gender
2  32    175     70   Male
3  28    160     55 Female
4  40    180     80   Male
df[df$Height < 170, ]
  Age Height Weight Gender
1  25    168     65 Female
3  28    160     55 Female
df[df$Weight > 60, ]
  Age Height Weight Gender
1  25    168     65 Female
2  32    175     70   Male
4  40    180     80   Male

Subsetting columns

From Dr. Gaston Sanchez

💻 Hands-On

Write R code to subset the indicated values of df

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male

# column 1

# columns 1 to 3 (1, 2, 3)

# columns 1, 3, and 4 only
# column 1
df[, 1]
[1] 25 32 28 40
# columns 1 to 3 (1, 2, 3)
df[, 1:3]
  Age Height Weight
1  25    168     65
2  32    175     70
3  28    160     55
4  40    180     80
# columns 1, 3, and 4 only
df[c(1, 3, 4), ]
  Age Height Weight Gender
1  25    168     65 Female
3  28    160     55 Female
4  40    180     80   Male
  • Another way to subset column(s) of a data frame is to use the brackets [] with column name(s). If more than one column name are given, the output is a data frame.
df[, c('Age', 'Gender')]
  Age Gender
1  25 Female
2  32   Male
3  28 Female
4  40   Male
  • Otherwise, the ouput is a vector.
df[, 'Height']
[1] 168 175 160 180

💻 Hands-On

Write R code to subset the indicated column(s) of df using the brackets []

#   Age Height Weight Gender
# 1  25    168     65 Female
# 2  32    175     70   Male
# 3  28    160     55 Female
# 4  40    180     80   Male

# Height, Weight

# Gender, Weight, Age

# Gender

# Age
# Height, Weight
df[, c('Height', 'Weight')]
  Height Weight
1    168     65
2    175     70
3    160     55
4    180     80
# Gender, Weight, Age
df[, c('Gender', 'Weight', 'Age')]
  Gender Weight Age
1 Female     65  25
2   Male     70  32
3 Female     55  28
4   Male     80  40
# Gender
df[, 'Gender']
[1] "Female" "Male"   "Female" "Male"  
# Age
df[, 'Age']
[1] 25 32 28 40

Adding a column

  • Using the dollar operator $ is the easiest way to add a column
df <- data.frame(
  Age    = c(25, 32, 28, 40), 
  Height = c(168, 175, 160, 180), 
  Weight = c(65, 70, 55, 80), 
  Gender = c('Female', 'Male', 'Female', 'Male')
)

df$Home <- c('Pitman', 'Glassboro', 'Clayton', 'Deptford')
df
  Age Height Weight Gender      Home
1  25    168     65 Female    Pitman
2  32    175     70   Male Glassboro
3  28    160     55 Female   Clayton
4  40    180     80   Male  Deptford

💻 Hands-On

Create a new data frame as follows:

#>   Age Height Weight Gender      Home Married Savings
#> 1  25    168     65 Female    Pitman    TRUE   10000
#> 2  32    175     70   Male Glassboro    TRUE   25000
#> 3  28    160     55 Female   Clayton   FALSE    7000
#> 4  40    180     80   Male  Deptford    TRUE   40000
df$Married <- c(TRUE, TRUE, FALSE, TRUE)
df$Savings <- c(10000, 25000, 7000, 40000)
df
  Age Height Weight Gender      Home Married Savings
1  25    168     65 Female    Pitman    TRUE   10000
2  32    175     70   Male Glassboro    TRUE   25000
3  28    160     55 Female   Clayton   FALSE    7000
4  40    180     80   Male  Deptford    TRUE   40000

Removing a column

  • Removing a column can be done with the dollar operator $ as follows:
df$Home <- NULL
df
  Age Height Weight Gender Married Savings
1  25    168     65 Female    TRUE   10000
2  32    175     70   Male    TRUE   25000
3  28    160     55 Female   FALSE    7000
4  40    180     80   Male    TRUE   40000

💻 Hands-On

Remove the column Married in df

df$Married <- NULL
df
  Age Height Weight Gender Savings
1  25    168     65 Female   10000
2  32    175     70   Male   25000
3  28    160     55 Female    7000
4  40    180     80   Male   40000

Transforming a column

  • The dollar operator $ allows us to subset and transform a column.
df$Savings <- df$Savings / 1000
df
  Age Height Weight Gender Savings
1  25    168     65 Female      10
2  32    175     70   Male      25
3  28    160     55 Female       7
4  40    180     80   Male      40

💻 Hands-On

Convert Weight in kilograms to lbs

df$Weight <- df$Weight * 2.20462
df
  Age Height   Weight Gender Savings
1  25    168 143.3003 Female      10
2  32    175 154.3234   Male      25
3  28    160 121.2541 Female       7
4  40    180 176.3696   Male      40